Prince Joseph Erneszer Javier
Reynaldo Tugade, Jr.
This notebook explores certain events extracted from the GDELT Dataset. The size of the dataset is 542Gb worth of compressed open source index of the world's news media. Records and meta information are grouped into three tables namely Events, Mentions and Global Knowledge Graph (GKG). The tables were stored in compressed CSV files separated by date and table type. The use of primitive Python data structures to summarize insights is inherently tricky especially since we are dealing with large quantities of data. Hence, this notebook leveraged Dask to enable parallel and out-of-core computation. Using parallel computing, we achieved to collate general information capturing the potential theoretical impact specific types of event will have on the stability of a country such as the events' sentiment scores (AvgTone) and Goldstein Scale. Results show that for the three months of May-July, 2017, some countries in South America, Northern Africa, Middle East, and Central Asia experienced the most adverse events from a regional stability perspective.
The GDELT Dataset
The Global Database of Events, Language, and Tone, which known as GDELT, is a new CAMEO-coded dataset containing geo-located events with global coverage from 1979 to the present. Each record consists of two actors and the action performed by Actor1 upon Actor2. Additionally as stated in the GDELT codebook, a wide array of variables break out the raw CAMEO actor codes into their respective fields to make it easier to interact with the data. Action codes are broken out into their hierarchy with Goldstein ranking scores included. Plus a unique array of georeferencing fields that offer estimated landmark-centroid-level geographic positioning of both actors and the location of the action. Lastly, a new “Mentions” table is added, which records the network trajectory of the story of each event “in flight” through the global media system. In this notebook, we aim to answer which countries were least stable during the period of May to July, 2017 based on the Goldstein scores and Tone of the events in those countries. [1].
Goldstein Scale and Average Tone
Goldstein Scale refers to the intensity of conflict or cooperation depending on the type of event.[2] This value ranges from -10 to 10. The more negative the value, the higher the intensity of conflict while a more positive value indicates more cooperation. Note that this does not depend on the specifics of the event but rather only on the type of event, e.g. military attack. Average Tone is the sentiment of the event based on the words used in the documents. This value ranges from -100 for extremely negative events to +100 for extremely positive events.
Dask for Big Data Processing
With large chunks of records from this dataset coming from a multitude of news sources stored in a set of compressed files, the average personal notebook computer would have much of a difficulty attaining a speedup in data processing[3]. An analysis would mean to take some time if we pursue with just using native Python data structures. In this view, using an algorithm capable of handling multiple cores simultaneously becomes very important. Dask, a specification to encode parallel algorithms using the same Python callables extends further Python's capacity to parallelize complex codebases. It can significantly improve time to explore large amounts of data by effectively managing disk usage and task scheduling. This notebook shows the full extent of using Dask to quickly getting summarized insights from a large set of data.
The paper explores the Goldstein and Average Tone values of global and Philippine events from May to July 2017. The methods used included Dask for loading and processing of the data using distributed workers (external computers), Pandas for loading resulting smaller datasets and saving them into CSV files, Matplotlib for visualizing static data, and finally Plotly for visualizing interactive data.
Since the GDELT database is massive, we focused on specific attributes that we wish to explore. We are interested in finding the impacts of events on the stability of a country or a region. Of the three tables GDELT has we concentrated on the events table which contains the following information:
Table 1. Specific GDELT attributes chosen for analysis
| Name | Type | Description | |
|---|---|---|---|
| GlobalEventID | Integer | Globally unique identifier assigned to each event record that uniquely identifies it in the master dataset | |
| Day | Integer | Date the event took place in YYYYMMDD format | |
| MonthYear | Integer | Alternative formatting of the event date, in YYYYMM format | |
| Year | Integer | Alternative formatting of the event date, in YYYY format | |
| Actor1Code | String | The complete raw CAMEO code for Actor1 (includes geographic, class, ethnic, religious, and type classes). May be blank if the system was unable to identify an Actor1 | |
| Actor1Name | String | The actual name of the Actor1. In the case of a political leader or organization, this will be the leader’s formal name (GEORGE W BUSH, UNITED NATIONS), for a geographic match it will be either the country or capital/major city name (UNITED STATES / PARIS), and for ethnic, religious, and type matches it will reflect the root match class (KURD, CATHOLIC, POLICE OFFICER, etc) | |
| Actor1CountryCode | String | The 3-character CAMEO code for the country affiliation of Actor1 | |
| GoldsteinScale | Float | Each CAMEO event code is assigned a numeric score from -10 to +10, capturing the theoretical potential impact that type of event will have on the stability of a country | |
| NumMentions | Integer | This is the total number of mentions of this event across all source documents during the 15 minute update in which it was first seen | |
| AvgTone | Numeric | This is the average “tone” of all documents containing one or more mentions of this event during the 15 minute update in which it was first seen |
These attributes are stored in a compressed file sorted by date. Each compressed file used zip as a mode of compression and further grouped by type of table (e.g., export, mention, GKG).
We performed an exploratory data analysis of the Goldstein Scale and Average Tone of global events from May to July 2017. In summary, we visualized the scatterplot of the events worldwide and in the Philippines. We visualized the top ten countries with most positive and most negative Goldstein scales and average tones. We compared these values with values globally and in the Philippines. When ranking the countries according to Goldstein scale and average tone, the countries with number of events below the 10th percentile were removed from the dataset as the number of events were deemed too few to estimate the general stability of the region.
The main insights from this exploratory analysis are:
We first loaded the packages that we needed.
# importing packages
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import glob
import dask.dataframe as dd
import dask.bag as db
from dask.delayed import delayed
from dask.distributed import Client
from sklearn.externals.joblib import parallel_backend
from dask.diagnostics import ProgressBar
from plotly.offline import plot, iplot, init_notebook_mode
import plotly.graph_objs as go
init_notebook_mode()
We connected with the dask cluster.
# set to run Dask commands in this "cluster"
client = Client('10.233.29.219:8786')
We checked the contents of the folder gdeltv2. The folder contains mentions.CSV.zip, gkg.csv.zip, and export.CSV.zip. From reading the GDELT documentation, the features we need are in the export.CSV.zip files.
# check the first five contents of the folder
path = '/mnt/data/public/gdeltv2/*'
glob.glob(path)[:5]
We checked the contents of one export.CSV.zip file. We found the data mentioned in the GDELT documentation but there were no column headers.
# we see three kinds of files
# let's open the contents one by one
# we define sample sets
f2 = ['/mnt/data/public/gdeltv2/20170611004500.export.CSV.zip']
# we import the progress bar
# pbar = ProgressBar()
# pbar.register()
# we load export.CSV.zip into a delayed Pandas dataframe
dfs = [delayed(pd.read_csv)(fn, delimiter='\t', header=None,
dtype='str', engine='python') for fn in f2]
df = dd.from_delayed(dfs)
print(f2)
display(df.head().T)
We encoded the column headers that we found from the GDELT documentation.
# We see that there are no columns in the dataset
# We found the columns in GDELT website
events_columns = ['GlobalEventID', 'Day', 'MonthYear', 'Year', 'FractionDate',
'Actor1Code', 'Actor1Name', 'Actor1CountryCode',
'Actor1KnownGroupCode', 'Actor1EthnicCode',
'Actor1Religion1Code', 'Actor1Religion2Code',
'Actor1Type1Code', 'Actor1Type2Code', 'Actor1Type3Code',
'Actor2Code', 'Actor2Name', 'Actor2CountryCode',
'Actor2KnownGroupCode', 'Actor2EthnicCode',
'Actor2Religion1Code', 'Actor2Religion2Code',
'Actor2Type1Code', 'Actor2Type2Code', 'Actor2Type3Code',
'IsRootEvent', 'EventCode', 'EventBaseCode',
'EventRootCode', 'QuadClass', 'GoldsteinScale',
'NumMentions', 'NumSources', 'NumArticles', 'AvgTone',
'Actor1Geo_Type', 'Actor1Geo_Fullname',
'Actor1Geo_CountryCode', 'Actor1Geo_ADM1Code',
'Actor1Geo_ADM2Code', 'Actor1Geo_Lat', 'Actor1Geo_Long',
'Actor1Geo_FeatureID', 'Actor2Geo_Type',
'Actor2Geo_Fullname', 'Actor2Geo_CountryCode',
'Actor2Geo_ADM1Code', 'Actor2Geo_ADM2Code',
'Actor2Geo_Lat', 'Actor2Geo_Long', 'Actor2Geo_FeatureID',
'ActionGeo_Type', 'ActionGeo_Fullname',
'ActionGeo_CountryCode', 'ActionGeo_ADM1Code',
'ActionGeo_ADM2Code', 'ActionGeo_Lat', 'ActionGeo_Long',
'ActionGeo_FeatureID', 'DATEADDED', 'SOURCEURL']
We defined a function that can load the contents of a set of file paths and return a dask dataframe. The preprocessing performed by the function are:
# We are ready to load a larger dataset
# Import regex
import re
# we import the progress bar
pbar = ProgressBar()
pbar.register()
def load_events(filenames):
'''
Load events data from list of filenames
Select necessary columns, drop null values
Convert numerical values to float
Return the cleaned dask dataframe
'''
f_events = filenames
# we load export.CSV.zip into a delayed Pandas dataframe
dfs_events = [delayed(pd.read_csv)(fn, delimiter='\t', header=None,
dtype='str', names=events_columns, engine='python') for fn in f_events]
df_events = dd.from_delayed(dfs_events).set_index('GlobalEventID')
# Drop null values
df_events = df_events.dropna(
subset=['GoldsteinScale', 'NumMentions', 'AvgTone', 'Actor1Geo_Lat', 'Actor1Geo_Long', 'Actor1Geo_CountryCode', 'Actor1Geo_Fullname'])
print("> Null values dropped.")
# Numerical datapoints to clean
to_clean = ['Day', 'MonthYear', 'GoldsteinScale', 'NumMentions',
'NumSources', 'NumArticles', 'AvgTone', 'Actor1Geo_Lat', 'Actor1Geo_Long']
# Numerical datapoints to convert (lat and long returns error for some reason)
to_conv = ['Day', 'MonthYear', 'GoldsteinScale', 'NumMentions',
'NumSources', 'NumArticles', 'AvgTone']
# Clean numerical datapoints by removing non-numerical data
pattern = re.compile('#')
for col in to_clean:
df_events[col] = df_events[col].str.strip().str.replace(pattern, '')
print("> Removed non-numerical values from numerical dataset.")
# Convert to numerical data
df_events[to_conv] = df_events[to_conv].astype(float)
print("> Converted numerical data to float.")
# Check average goldstein score, avgTone of each country/location
# Columns to keep
keep_cols = ['Day', 'MonthYear', 'GoldsteinScale', 'NumMentions',
'NumSources', 'NumArticles', 'AvgTone', 'Actor1Geo_Lat', 'Actor1Geo_Long',
'Actor1Geo_CountryCode', 'Actor1Geo_Fullname']
# Extract the needed data
df_ = df_events[keep_cols]
print("> Selected needed data from Events data.")
print(f"> Type of dataframe {type(df_)}")
return df_
Since we were only concerned with dates from May to July 2017, the datasets contained in .export.CSV.zip files with filenames starting in 201705, 201706, or 201707 were loaded.
# load the dataset
f_events = glob.glob('/mnt/data/public/gdeltv2/20170[5, 6, 7]*.export.CSV.zip')
df_ = load_events(f_events)
As another filter and to ensure that only data with dates from May to July 2017 were selected, only rows with MonthYear values of 201705, 201706, or 201707 were selected.
# Filter only MonthYears June 2017
df_ = df_[df_["MonthYear"].isin([201705, 201706, 201707])]
New columns containing the number of mentions per event X Goldstein Scale or AvgTone were created. These will be used for calculating the average Goldstein Scale and AvgTone per country given by:
\begin{equation} Avg Score = \frac{\sum(NumMentions * Score)}{\sum{NumMentions}} \end{equation}
# Average Goldstein and Avg Tone
# Average Goldstein score pe country weighted by number of mentions (importance)
# New column added containing the GoldsteinScale and Avg Tone * num Mentions
df_['goldstein * num_mentions'] = df_['GoldsteinScale']*df_['NumMentions']
df_['avgtone * num_mentions'] = df_['AvgTone']*df_['NumMentions']
print("> New weighted Goldstein Scale column created.")
print("> New weighted avgtone column created.")
All data selected were then grouped according to location and by month. The values in each feature were summed up by group.
# Group by country and compute
df_by_country_month = df_[['Actor1Geo_CountryCode', 'MonthYear', 'NumMentions', 'goldstein * num_mentions',
'avgtone * num_mentions']].groupby(by=['Actor1Geo_CountryCode', 'MonthYear']).sum().compute()
print("> Data grouped by Actor1Geo_CountryCode and computed (sum) successfully.")
The table below shows the first five results of the grouped values.
# Head of summed values by country by month
df_by_country_month.head()
The Average Tone and Average Goldstein Scale per country were then calculated following the equation above. The resulting dataset was saved in a csv file for quick reference.
# Get average goldstein and average tone per country per month
# Note: Average Goldstein and Avg Tone = sum(num mentions * score) / sum(num_mentions per country)
df_by_country_month['avg_goldstein'] = df_by_country_month['goldstein * num_mentions'] / \
df_by_country_month['NumMentions']
df_by_country_month['avg_avgtone'] = df_by_country_month['avgtone * num_mentions'] / \
df_by_country_month['NumMentions']
df_by_country_month.to_csv("data/df_by_country_month.csv", index=True)
print("> Successfully computed average tone and goldstein per country per month")
print("> Successfully saved dataset to csv")
We loaded the previously saved dataset into a pandas dataframe.
df_by_country_month = pd.read_csv("data/df_by_country_month.csv")
print("Loaded df_by_country_month csv file.")
Below are the first five rows in the loaded dataset.
df_by_country_month.head()
We calculated the Average Tone and Average Goldstein Values per location over the whole 3-month period.
# Get the average values per country for the whole scope of date
df_by_country = df_by_country_month.groupby("Actor1Geo_CountryCode").sum()
df_by_country['avg_goldstein'] = df_by_country['goldstein * num_mentions'] / \
df_by_country['NumMentions']
df_by_country['avg_avgtone'] = df_by_country['avgtone * num_mentions'] / \
df_by_country['NumMentions']
print("> Calculated avg_goldstein and avg_avgtone per country for the whole duration considered")
Since the locations are in FIPS format, we loaded a dictionary containing the location name per FIPS value. The dictionary doesn't contain DA, WI, and YI, so these three were added.
# load country codes dictionary
country_codes = dict(pd.read_csv('fips.csv', index_col='Code').T)
# These are locations not in the dictionary
country_codes["DA"] = ["Denmark"]
country_codes["WI"] = ["Wisonsin"]
country_codes["YI"] = ["Serbia and Montenegro"]
The locations with events having a total number of mentions in the bottom 10 percentile were removed since these events were deemed too few to give a general sense of the regions' stability.
# Only get the number of mentions above the 10th percentile
# This will filter out the least important 10% of events
lowest10 = df_by_country.NumMentions.quantile(0.1)
df_by_country = df_by_country[df_by_country.NumMentions >= lowest10]
print(
f"> Removed NumMentions less than {lowest10}: least important for plotting")
A function that plots a barchart of the ten countries with highest and lowest Goldstein and AvgTone values was defined below.
def plot_barchart(df, value):
"Plot bar chart of a value (string) per location"
# Sort the countries by weighted goldstein scale
_ = len(df)
print(f"> {_} locations in the dataset")
to_sort = value
df_sorted = df.sort_values(
by=to_sort, ascending=False).reset_index(drop=False)
print(f"> Sorted according to {to_sort} and reset index")
# top 10 and bottom 10 countries
top = df_sorted.iloc[:10, :]
bottom = df_sorted.iloc[-10:, :]
# Philippine value
ph = df_sorted[df_sorted.Actor1Geo_CountryCode == 'RP']
print("> loaded Philippines value")
country_names_top = [country_codes[i][0]
for i in top.Actor1Geo_CountryCode]
country_names_bottom = [country_codes[i][0]
for i in bottom.Actor1Geo_CountryCode]
# plot top countries
plt.barh('Global Average', world)
plt.barh('Philippines', ph[to_sort])
plt.barh(country_names_top, top[to_sort])
plt.barh(country_names_bottom, bottom[to_sort])
plt.yticks(rotation=0)
plt.xlabel(f'{value} Scale')
plt.ylabel("Country Code")
plt.title(f'Countries with Lowest and Highest {value}')
plt.tight_layout()
plt.savefig(f'charts/bar_{value}.png', dpi=150)
The global Goldstein Average was calculated to be 0.49.
# Get the global mean of goldstein score
world = np.sum(df_by_country_month["goldstein * num_mentions"]
) / np.sum(df_by_country_month["NumMentions"])
print(f"> loaded global value: {world}")
The chart below shows the countries with the most positive and most negative Goldstein scores.
Among the ten countries with most negative Goldstein Scores, two are from Africa, six are from the Middle East and Central Asia, and the other two are Venezuela and Serbia and Montenegro. The country with the most negative Goldstein Score is Central African Republic at -1.80. The global Goldstein average is slightly positive at around 0.49 while the average for the Philippines is slightly negative at -0.27.
# Plot avg goldstein for the whole 3 month period May June Jul 2017
plot_barchart(df_by_country, 'avg_goldstein')
The global Average Tone value is -2.05.
# Get the global mean of avgtone
world = np.sum(df_by_country_month["avgtone * num_mentions"]
) / np.sum(df_by_country_month["NumMentions"])
print(f"> loaded global value: {world}")
The chart below shows the countries with the most positive and most negative Goldstein scores.
Among the ten countries with the most negative Average Tone, two are from Africa, three are from the Middle East, and three are from Europe. The country with the most negative Average Tone is Venezuela at -5.55. The global average tone is -2.05 while the average tone in the Philippines is -2.97.
# Plot avg tone for the whole 3 month period May June Jul 2017
plot_barchart(df_by_country, 'avg_avgtone')
In this section, we plotted a scatterplot of Goldstein and AvgTone values of 1% of all the events from May to July 2017.
We first obtained a sample of 1% of the dataset which would be plotted in a scatterplot.
# Get a sample for plotting
frac = 0.01
df_events_sample = df_.sample(frac=frac).persist()
print(f"> Obtained a {frac} sample for plotting.")
print("> Selected data persisted into workers.")
The sampled dataset was saved in a csv for quick reference.
# Save sample to csv
df_events_sample.compute().to_csv("data/df_events_sample_coord.csv")
print("> df_events_sample_coord saved to csv")
The dataset saved above was then loaded in a dataframe.
# Load df_events_sample.csv
df_events_sample = pd.read_csv("data/df_events_sample_coord.csv")
df_events_sample.head()
The chart below shows the distribution of Avg Tone values in the sampled dataset. The global average value in the sample was found to be -2.01.
plt.figure(figsize=(5,4))
plt.hist(df_events_sample.AvgTone, bins=30);
plt.title("Histogram of AvgTone in the Sampled Dataset")
plt.ylabel("Counts of Events")
plt.xlabel("AvgTone Value")
plt.tight_layout()
plt.savefig("charts/histogram_avgtone.png", dpi=150)
print(f"> Mean of AvgTone in the sampled dataset: {np.mean(df_events_sample.AvgTone)}")
The chart below shows the distribution of Goldstein values in the sampled dataset. The global average value in the sample was found to be 0.56.
plt.figure(figsize=(5,4))
plt.hist(df_events_sample.GoldsteinScale, bins=30);
plt.title("Histogram of Goldstein Scale in the Sampled Dataset")
plt.ylabel("Counts of Events")
plt.xlabel("Goldstein Value")
plt.tight_layout()
plt.savefig("charts/histogram_goldstein.png", dpi=150)
print(f"> Mean of Goldstein in the sampled dataset: {np.mean(df_events_sample.GoldsteinScale)}")
The longitudes and latitudes of each event and the corresponding Goldstein and Avg Tone values were extracted from the dataset.
# Plot the longitudes and latitudes color coded according to Goldstein value
y = df_events_sample['Actor1Geo_Lat']
x = df_events_sample['Actor1Geo_Long']
goldstein = df_events_sample['GoldsteinScale']
avgtone = df_events_sample['AvgTone']
num_mentions = df_events_sample['NumMentions']
print("> Coordinates and other valuable data to be visualized computed successfully.")
The events were plotted according to coordinates and color coded by Goldstein Score (red being most negative, green being most positive, and yellow as most neutral). The size of each marker represents the number of mentions of the event (importance). Although negative values are generally present globally, there are observable prominent red patches in some parts of the US, South America, Middle East, and Africa.
# Goldstein values vs Latitude and Longitude
# Color is goldstein while size is importance (num_mentions)
plt.style.use('default')
f, ax = plt.subplots(figsize=(11,5))
ax.scatter(x, y, c=goldstein, marker='o', s=num_mentions/3, cmap='RdYlGn', alpha=0.75)
ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)
plt.axis('equal')
plt.xlabel('longitude')
plt.ylabel('latitude')
plt.title('Goldstein score per event from May to July 2017')
plt.tight_layout()
# Save figure
plt.savefig("charts/scatter_goldstein_world.png", dpi=150)
To see the locations of top 10% most positive and most negative events, we filtered the events with Goldstein values below the 10th percentile and above 90th percentile.
# Plot the longitudes and latitudes color coded according to Goldstein value
# Filter Goldstein values in the 5th percentile and below
def plot_percentiles(name, df, value, percentile):
"""Given dataframe, value to plot (str) and percentile, we plot the top and bottom percentile"""
df_red1 = df[df[value] <= df[value].quantile(percentile)]
y_red1 = df_red1['Actor1Geo_Lat']
x_red1 = df_red1['Actor1Geo_Long']
goldstein_red1 = df_red1[value]
num_mentions_red1 = df_red1['NumMentions']
print(f"> {value} values in the bottom {percentile} calculated successfully.")
# Filter Values in the 95th percentile and above
df_green1 = df[df[value] >= df[value].quantile(1 - percentile)]
y_green1 = df_green1['Actor1Geo_Lat']
x_green1 = df_green1['Actor1Geo_Long']
goldstein_green1 = df_green1[value]
num_mentions_green1 = df_green1['NumMentions']
print(
f"> {value} values in the top {percentile} calculated successfully.")
print(f"> Coordinates and other valuable data to be visualized computed successfully.")
# Values vs Latitude and Longitude
# Color is goldstein while size is importance (num_mentions)
# BOTTOM 5%
plt.style.use('default')
f, ax = plt.subplots(figsize=(11, 5))
ax.scatter(x_red1, y_red1, c='r', marker='o', s=num_mentions /
3, alpha=1., label=f'bottom {percentile*100}%')
ax.scatter(x_green1, y_green1, c='g', marker='o',
s=num_mentions/3, alpha=0.5, label=f'top {percentile*100}%')
ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)
plt.axis('equal')
plt.xlabel('longitude')
plt.ylabel('latitude')
plt.title(
f'5% Most Positive and Negative {value} scores per location from May to July 2017')
plt.legend()
plt.tight_layout()
# Save figure
plt.savefig(f"charts/scatter_{value}_{name}_top_bottom.png", dpi=150)
The chart below shows the 10% of events with most negative and 10% of events with positive Goldstein values. Notable is the distribution of events in the Middle East. Some areas in the Middle East have markedly more negative Goldstein values than positive ones.
plot_percentiles('world', df_events_sample, 'GoldsteinScale', 0.1)
Similarly, the events were plotted according to coordinates and color coded by Avg Tone (red being most negative, green being most positive, and yellow as neutral). The size of each marker represents the number of mentions of the event (importance). In general, the average tone distribution looks even globally.
# AvgTone values vs Latitude and Longitude
# Color is avg tone while size is importance (num_mentions)
f, ax = plt.subplots(figsize=(11,5))
ax.scatter(x, y, c=avgtone, marker='o', s=num_mentions/3, cmap='RdYlGn', alpha=0.75)
plt.axis('equal')
ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)
plt.xlabel('longitude')
plt.ylabel('latitude')
plt.title('Average Tone per event from May to July 2017')
plt.tight_layout()
# Save figure
plt.savefig("charts/scatter_avgtone_world.png", dpi=150)
The chart below shows the 10% of events with most negative and 10% of events with positive AvgTone values. Notable is the distribution of events in the Middle East and Africa. Some areas in the Middle East and Africa have markedly more negative AvgTone values than positive ones.
plot_percentiles('world', df_events_sample, 'AvgTone', 0.1)
Below, we visualized chloropleth maps of the Goldstein Scores adn AvgTone globally.
df = pd.read_csv('data/df_by_country_month.csv')
df_data = df[['Actor1Geo_CountryCode', 'MonthYear','avg_goldstein', 'avg_avgtone']]
df_country_map = pd.read_csv('countrymap.txt',sep='\t')
df_country_map.columns = ['Country', 'Actor1Geo_CountryCode', '3let']
def data_generator(time_delta_list, df, df_country_map, column_to_plot, colorbar_title):
df_data_superset = df[['Actor1Geo_CountryCode',
'MonthYear', 'avg_goldstein', 'avg_avgtone']]
df_country_map.columns = ['Country', 'Actor1Geo_CountryCode', '3let']
data = []
for every_timedelta in time_delta_list:
df_data = df_data_superset.query('MonthYear=='+str(every_timedelta))
df_to_plot = pd.merge(df_country_map, df_data, how='left', on=[
'Actor1Geo_CountryCode'])
df_to_plot['avg_goldstein'] = df_to_plot['avg_goldstein'].fillna(0)
df_to_plot['avg_avgtone'] = df_to_plot['avg_avgtone'].fillna(0)
data.append(dict(
visible=False,
type='choropleth',
locations=df_to_plot['3let'],
z=df_to_plot[column_to_plot],
text=df_to_plot['Country'],
colorscale=[[0.0, 'rgb(165,0,38)'], [0.1111111111111111, 'rgb(215,48,39)'], [0.2222222222222222, 'rgb(244,109,67)'],
[0.3333333333333333, 'rgb(253,174,97)'], [0.4444444444444444, 'rgb(254,224,144)'], [
0.5555555555555556, 'rgb(224,243,248)'],
[0.6666666666666666, 'rgb(171,217,233)'], [0.7777777777777778, 'rgb(116,173,209)'], [
0.8888888888888888, 'rgb(69,117,180)'],
[1.0, 'rgb(49,54,149)']],
autocolorscale=False,
reversescale=False,
marker=dict(
line=dict(
color='rgb(180,180,180)',
width=0.5
)),
colorbar=dict(
autotick=False,
tickprefix='',
title=colorbar_title),
))
return data
def draw_choropleth(data_, title, df, filename):
data_[0]['visible'] = True
steps = []
for i in range(len(data_)):
step = dict(
method='restyle',
args=['visible', [False] * len(data_)],
label=sorted(df['MonthYear'].unique())[i],
)
step['args'][1][i] = True
steps.append(step)
sliders = [dict(
active=0,
currentvalue={"prefix": "YearMonth: "},
pad={"t": 10},
steps=steps
)]
layout = dict(
autosize=False,
width=1000,
height=600,
title=title,
dragmode='pan',
geo=dict(
showframe=False,
showcoastlines=False,
projection=dict(
type='Mercator'
)
),
sliders=sliders
)
fig = dict(data=data_, layout=layout)
iplot(fig, validate=False, filename=filename)
Notable regions with most negative Goldstein Value are South America, particularly Venezuela, African countries like Somalia, and Middle East especially Afghanistan.
data_ = data_generator(df=df, df_country_map=df_country_map, time_delta_list=sorted(
df['MonthYear'].unique()), column_to_plot='avg_goldstein', colorbar_title='Goldstein value')
draw_choropleth(data_, 'Monthly Global GoldStein Value',
df, 'goldstein-world-map')
Notable regions with most negative AvgTone values are South America, particularly Venezuela, African countries particularly Central Africa and Congo, and the Middle East like Libya, Egypt, and Iraq.
data_ = data_generator(df=df, df_country_map=df_country_map, time_delta_list=sorted(
df['MonthYear'].unique()), column_to_plot='avg_avgtone', colorbar_title='AvgTone value')
draw_choropleth(data_, 'Monthly Global AvgTone Value', df, 'avgtone-world-map')
To see the events in the Philippines, we first filtered only the data for the Philippines from the whole dataset from May to July 2017.
# Filter Philippines dataframe
df_ph = df_[df_.Actor1Geo_CountryCode == 'RP']
print("> Selected Philippines and successfully computed")
We saved the sampled dataset to a csv file.
# Save to csv
df_ph.compute().to_csv("data/df_ph_coord.csv")
print("> Successfully saved Ph sample to csv")
We loaded the csv file into a dataframe.
# Load
df_ph = pd.read_csv("data/df_ph_coord.csv")
Below are the first five rows in the loaded dataset.
df_ph.head()
We selected the latitudes and longitudes to be plotted.
# Get Latitude and Longitude
y2 = df_ph['Actor1Geo_Lat']
x2 = df_ph['Actor1Geo_Long']
print("> Coordinates to be visualized computed successfully.")
We selected the Goldstein and AvgTone values to be plotted.
# Get goldstein and avgtone values
goldstein2 = df_ph['GoldsteinScale']
avgtone2 = df_ph['AvgTone']
num_mentions2 = df_ph['NumMentions']
print("> Other valuable data to be visualized computed successfully.")
Below is the plot of all Goldstein values of events in the Philippines during the three-month period.
# Goldstein values vs Latitude and Longitude
# Color is goldstein while size is importance (num_mentions)
plt.style.use('default')
f, ax = plt.subplots(figsize=(6,6))
ax.scatter(x2, y2, c=goldstein2, marker='o', s=num_mentions2/2, cmap='Spectral', alpha=0.75)
plt.axis('equal')
ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)
plt.xlabel('longitude')
plt.ylabel('latitude')
plt.title('Average Goldstein per event in the Ph from May - July 2017')
plt.tight_layout()
plt.savefig('charts/scatter_goldstein_ph.png', dpi=150)
Below is the plot of all AvgTone values of events in the Philippines during the three-month period.
# AvgTone values vs Latitude and Longitude
# Color is goldstein while size is importance (num_mentions)
f, ax = plt.subplots(figsize=(6,6))
ax.scatter(x2, y2, c=avgtone2, marker='o', s=num_mentions2/2, cmap='Spectral', alpha=0.75)
plt.axis('equal')
ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)
plt.xlabel('longitude')
plt.ylabel('latitude')
plt.title('Average Tone per event in the Ph from May - July 2017')
plt.tight_layout()
plt.savefig('charts/scatter_avgtone_ph.png', dpi=150)
From the chart below, we can't visually find a distinct concentration of negative or positive Goldstein values in an area.
plot_percentiles('ph', df_ph, 'GoldsteinScale', 0.1)
From the chart below, we can see a relatively high concentration of negative AvgTone compared with positive AvgTone in the Visayas and Mindanao regions.
plot_percentiles('ph', df_ph, 'AvgTone', 0.1)
We have seen how quickly we could get insights specifically from a large dataset by using Dask. For a three month period of May to July 2017, several areas had events which had much of a consistent impact to its general political stability. Countries like Venezuela, parts of Africa, and parts of the Middle East and Central Asia (Afghanistan) had consistently the most negative Goldstein scores and AvgTone. During these months, the types of events were either terrorist attacks against public infrastructure[4], protest against the government[5], or war and conflict in those countries [6]. Among the African countries, Niger stands out to have the most positive Goldstein and Avg Tone scores. These indicate generally positive events during the three month period such as Niger's army rescuing 92 migrant workers after being left abandoned[7]. Other positive incidents in the region like freed captives[8], environmental awareness[9], show relatively higher scores which means these type of events can attribute to providing a positive impact on the general stability of a specific region. The Philippines had a generally more negative Goldstein and AvgTone scores than the global average during those three months. Negative events were more marked in Mindanao and Visayas. Those months were the heat of Marawi siege when terrorists took over the city of Marawi in Mindanao. This event prompted the proclamation of Martial Law in Mindanao and military intervention in the region especially Marawi. Marawi was declared free on October 17, 2017.[10]. The event also prompted Visayas to heighten security. [11] Further research may include expanding the scope of dates to one year or more, look for trends in instability over one year or more, and correlate the instability of the region to other data like GDP per capita and exports.
[1] GDELT Event Codebook V2.0 [PDF]. (2015, September 2). Gdeltproject.org. http://gdeltproject.org/ Michaelson, R. (2017, May 26).
[2]Goldstein Scale for WEIS Data. Retrieved from http://web.pdx.edu/~kinsella/jgscale.html
[3] Rocklin, M. (2015). Dask: Parallel Computation with Blocked Algorithms and Task Scheduling [PDF]. 14th Python in Science Conference. SCIPY 2015.A. (2017, May 20).
[4] Egypt launches raids in Libya after attack on Coptic Christians kills 26. Retrieved from https://www.theguardian.com/world/2017/may/26/several-killed-in-attack-on-bus-carrying-coptic-christians-in-egypt
[5] Venezuela: 50th day of protests brings central Caracas to a standstill. Retrieved from https://www.theguardian.com/world/2017/may/20/venezuela-50th-day-of-protests-brings-central-caracas-to-a-standstill
[6]"The city of Bangassou has turned into a battlefield; we fear the worst for the civilian population". Retrieved from: https://www.msf.org/central-african-republic-city-bangassou-has-turned-battlefield-we-fear-worst-civilian-population
[7] Telesur. (2017, June 14). Niger Army Rescue 92 Migrants Left for Dead in Sahara Desert. Retrieved from https://www.telesurenglish.net/news/Niger-Army-Rescue-92-Migrants-Left-for-Dead-in-Sahara-Desert-20170614-0011.html
[8] Busari, S., & Croft, J. (2017, May 08). 82 released Chibok schoolgirls arrive in capital. Retrieved from https://edition.cnn.com/2017/05/07/africa/chibok-girls-released/index.html
[9] Sebunya, K. (2017, July 31). Saving the world's wildlife is not just 'a white person thing'. Retrieved from https://www.theguardian.com/environment/africa-wild/2017/jul/31/saving-wildlife-conservation-africa-colonialism-race
[10] ABS-CBN. (2017, October 17). TIMELINE: The Battle for Marawi. Retrieved from https://news.abs-cbn.com/news/10/17/17/timeline-the-battle-for-marawi
[11] Mayol, A. et al. (2017, May 24). Alert up in Visayas amid Marawi crisis. Retrieved from https://newsinfo.inquirer.net/899312/alert-up-in-visayas-amid-marawi-crisis
We would like to acknowledge the Asian Institute of Management ACCESS Lab for the dataset, and Prof. Christian Alis for guidance.